Load Dataset

Data Info & Summary

df1: Raw Data

df2: Cleaned Data Ver.1a - Label Encoded

df2d: Cleaned Data Ver.1b - Dummy Variables (not used for Modeling)

df3: Cleaned Data Ver.2 - Label Encoded

df4: Cleaned Data Ver.3 - Final

Price vs Selected Predictors

Price vs Manufacturer

We can see that there seems to be significant price range differences among car manufacturers. However, the range and variance within each manufacturer seems very wide. This is most probably due to many other influencing factors as each manufacturer has various models (resulting in large number of car types) and each model may have a very different specification. Hence we do not include the manufacturer and model as predictors to our model.

Price vs Production Year

There doesn't seem a very clear relationship, but there seems to be an indication that the later the year, the more likely the car to have higher prices, and higher range of price.

Price vs MSRP

Price vs State

There are variations with price and state as some states have a narrower range of prices in comparison to the others.

Price vs Odometer

Even with newer odometers, there is still a wide price range between lower and higher prices.

Other Relevant Categorical Variables: Distribution

Condition

Cylinders

Fuel

Title Status

Transmission

Drive

Size (dropped during data cleansing as null >70%)

Type

Paint Color

State

For any null values, we can try to impute them by matching the manufacturer or car model.
Alternatively, based on the distribution of categorical variables above, any remaining null values can be filled with the mode in each categorical variable.

Clustering

We will try to cluster the cleaner dataset to understand the existing groupings of the data which can be seen as market segments of used car listings.

tSNE

There doesn't seem to be a clear distinction between different groups/clusters.

We can try other clustering method if we can find more meaningful/distinctive results.

DBScan

We will use eps=1 and min_samples=25 for 6 clusters + 1 outlier.

The cluster separation is not clear. We will try another clustering method with K-Means.

K-Means

Cluster separation is quite distinctive. We will apply this cluster to the remaining variables.

Due to rerun, the order of clusters have changed from the ones displayed in the slides (previous version).

In clustering below,
cluster0 = cluster1 in the slide/previous version. "The Mid Class"
cluster1 = cluster0 in the slide/previous version. "The Lower Class"
cluster2 = cluster2 in the slide/previous version. "The Higher Class"

Price Depreciation

Car Price by Manufacturer

Car Price by Model

Average Car Price from the Clean Dataset

Geo Location Visualization

Visualizing longitude and latitude data from the raw data.
We can see that the data is not clean since it's supposed to cover US area only.
These variables are not used as predictors.